perf(vectors): batch per-file embeddings + offload parse/enrich by HumanBean17 · Pull Request #334 · HumanBean17/java-codebase-rag

HumanBean17 · 2026-06-20T20:43:20Z

Summary

The vectors stage (the embedding phase of init / increment / reprocess) underused the embedding model's batching. process_*_file embedded each chunk serially, so SentenceTransformerEmbedder (@coco.fn.as_async(batching=True, max_batch_size=64)) only ever saw batches of 1 — batching only groups concurrently in-flight calls, and a serial await loop keeps one in flight. Worst for increment (few changed files → no cross-file batching).

Levers (`java_index_flow_lancedb.py`)

AST by Opus #1 — concurrent per-file embedding (java/sql/yaml). asyncio.gather(*(embedder.embed(ch.text) for ch in chunks)) → one model.encode([...]) (up to 64) instead of N batch-of-1 encodes.
Tier 1 completion plan + proposals (B2a + B4 + B5) #2 — offload parse + enrich off the event loop (java). asyncio.to_thread(_parse_and_enrich_java, ...) so the loop keeps the embedder's batching queue fed while a file parses (matters under init's high file concurrency). splitter.split stays inline (shared Rust singleton).

Review-driven fixes (two code-review rounds)

ast_java.py: parse_java is now thread-safe via a per-thread Parser (threading.local()). Round 1 found that Tier 1 completion plan + proposals (B2a + B4 + B5) #2 made enrich_chunk concurrent, and enrich transitively calls parse_java (collect_annotation_meta_chain → _collect_annotation_decl_index), which used the @lru_cache singleton tree-sitter Parser — not thread-safe. Root-cause fix at the source covers every caller (flow, enrich, graph builder). Added tests/test_ast_java_thread_safety.py.
app_main: warm per-project enrich caches before fanning out to threads. Round 2 found that concurrent enrich caused a collect_annotation_meta_chain thundering herd (up to ~14 worker threads each cold-missing and redundantly walk+parseing the whole project at cold start). Warming once on the event loop eliminates it.

Correctness — verified clean

Two full review rounds (8 finder angles + 4 deep concurrency verifiers, with empirical 16-thread × 2000-parse stress tests). Zero remaining races: the per-thread Parser covers every parse_java call site; an exhaustive transitive walk of enrich_chunk's callees found no other non-thread-safe shared state (pure functions, immutable frozenset constants, read-only lru_cache returns, ast/chunks never mutated). declare_row buffers → commit is all-or-nothing via the cocoindex memo guard, so no partial writes on mid-file failure.

Behaviour / compatibility

Output is equivalent: same chunk texts → same normalized vectors; rows still declared in chunk order; uuids stay random. No schema/ontology/env change.
First run after upgrade reprocesses once (process_java_file is @coco.fn(memo=True), changed body invalidates memo entries), then returns to incremental.

Validation

ruff check . — clean.
.venv/bin/python -m pytest tests -q — 839 passed, 13 skipped (heavy-gated).
Heavy e2e JAVA_CODEBASE_RAG_RUN_HEAVY=1 ... test_lancedb_e2e.py — main path passes (full cocoindex update --full-reprocess + graph + MCP search).
test_lancedb_ignore_file_reduces_indexed_java_files fails — pre-existing on master (verified by stashing the change and reproducing the identical FileExistsError; the test's own mkdir collides with a tracked fixture file). Unrelated; left untouched per scope.

🤖 Generated with Claude Code

process_*_file embedded each chunk serially (one `await embedder.embed()` per chunk in a for-loop), so the batched SentenceTransformerEmbedder (max_batch_size=64) only ever saw batches of 1. This wasted its batching within a file and hurt `increment` most: few changed files means little cross-file concurrency, so embeddings ran one-at-a-time. #1 — embed all chunks of a file concurrently via asyncio.gather so the embedder groups them into one model.encode(...). Applied to java/sql/yaml. #2 — for java, move parse_java + per-chunk enrich_chunk off the event loop (asyncio.to_thread) so the loop keeps the embedder's batching queue fed while a file is being parsed. parse_java is serialized by _PARSE_LOCK because it reuses the lru-cached singleton tree-sitter Parser, whose parse() mutates state and is not thread-safe. splitter.split stays inline (shared Rust RecursiveSplitter singleton); enrich_chunk is pure-Python over the immutable AST (lru_cache reads are GIL-safe) and runs without the lock. Output is equivalent: same chunk texts → same normalized vectors; rows are still declared in chunk order; uuids stay random. No schema/ontology/env change. Because process_java_file is @coco.fn(memo=True), the changed body invalidates memo entries, so the first run after upgrade reprocesses (equivalent to a full reindex) — subsequent runs are incremental again. Verified: ruff clean; 838 light tests pass; heavy test_lancedb_e2e (real cocoindex update --full-reprocess + graph + search) passes. One heavy test (test_lancedb_ignore_file_reduces_indexed_java_files) fails, but it fails identically on clean master (FileExistsError in the test's own mkdir, unrelated to this change). Co-Authored-By: Claude <noreply@anthropic.com>

Code review of the vectors perf change found a race it introduced: _parse_and_enrich_java runs enrich_chunk outside _PARSE_LOCK on concurrent worker threads, and enrich_chunk -> collect_annotation_meta_chain (lru_cached, non-serializing) -> _collect_annotation_decl_index -> parse_java. On a cold cache, multiple threads hit parse_java concurrently on the shared @lru_cache tree-sitter Parser, whose parse() mutates state and is not thread-safe -> corrupt AST or native segfault. (Before the perf change, enrich ran on the single event-loop thread, so this was serialized.) Root-cause fix at the source: _parser() now returns a per-thread Parser (threading.local) instead of a process-wide lru_cached singleton. Each OS thread gets its own Parser; the Language is immutable and shared. This protects EVERY parse_java caller — the flow's direct parse, enrich's transitive parse, and build_ast_graph — and preserves parse parallelism rather than serializing it. Removes the now-redundant _PARSE_LOCK from the flow file. Added tests/test_ast_java_thread_safety.py: 16 threads × 60 parses must each match the single-threaded reference (locks the invariant in; a revert to a shared Parser would corrupt/crash here). Verified: ruff clean; 839 light tests pass (+1 new); heavy test_lancedb_e2e main path (full reprocess + graph + search) passes. The pre-existing test_lancedb_ignore_file_reduces_indexed_java_files failure is unchanged (its own mkdir bug, fails on master). Co-Authored-By: Claude <noreply@anthropic.com>

…hreads Second code-review round found a perf regression introduced by lever #2: making enrich_chunk concurrent meant the first wave of worker threads each cold-miss collect_annotation_meta_chain (lru_cached, non-serializing) and each redundantly walk+parse the ENTIRE project (~min(32,cpu+4)x redundant full-project parses at cold start, GIL-serialized into the critical path) — directly offsetting the embedding-batching win on large repos. Fix: warm collect_annotation_meta_chain + load_brownfield_overrides ONCE on the event-loop thread in app_main, before coco.mount_each fans files into worker threads. Key derivation mirrors enrich_chunk exactly so the warmed entries are the ones workers hit; with warming every worker gets a cache hit (lru_cache reads are thread-safe). Wrapped so a warm-up failure never breaks indexing (falls back to lazy per-thread misses = prior behavior). The same review round confirmed ZERO correctness races remain: the per-thread Parser covers every parse_java call site (incl. enrich's transitive parse), and an exhaustive walk of enrich_chunk's callees found no other non-thread-safe shared state. Verified: ruff clean; heavy test_lancedb_e2e main path passes. Co-Authored-By: Claude <noreply@anthropic.com>

HumanBean17 and others added 3 commits June 20, 2026 23:42

HumanBean17 merged commit 9b1326b into master Jun 21, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vectors): batch per-file embeddings + offload parse/enrich#334

perf(vectors): batch per-file embeddings + offload parse/enrich#334
HumanBean17 merged 3 commits into
masterfrom
feat/vectors-stage-batching

HumanBean17 commented Jun 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HumanBean17 commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Levers (java_index_flow_lancedb.py)

Review-driven fixes (two code-review rounds)

Correctness — verified clean

Behaviour / compatibility

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HumanBean17 commented Jun 20, 2026 •

edited

Loading

Levers (`java_index_flow_lancedb.py`)